The Trykkefrihedensskrifter or Freedom of the Press Writings is a collection of small books (pamphlets) that got published during the press freedom era in Denmark between 1770 and 1773. Before that period, all books that people wanted to get published had to be approved by university professors. Through Johann Friedrich Struensee’s reforms this was not longer necessary and people started to write and publish their thoughts in the form of small pamphlets. In this pamphlets, they discussed everything from serious political, philosophical, economic treatises, over political commentary, criticism and satire, over essay writing, fiction and entertainment to gossip, libel and pornography. Bolle Willum Luxdorph collected around 1000 of these pamphlets, which were now digitized and made accessible via the Danish Royal Library. You can have a look at all the books here.
This was made possible by a book project. My colleague Frederik Stjernfelt wrote a massive book about the pamphlets and this time called Grov Konfækt which received a lot of praise by the media.
Well, I have a passion for Digital Humanities and getting insights into a large corpus of Danish pamphlets seemed like a very interesting project. One aspect that is really crucial about the 1000 books is that half of them are of unknown authorship. About a year ago I made the first efforts to find out who the authors of these books with unknown authorships were. In academia this discipline is usually called Styleometry which aims at solving such authorship attribution problems. Back then I tried an approach called bootstrapp consensus networks. However with not much success, but the results of these experiments can be found in this DHiNorden 2020 paper.
\(~\)
In a first step, I want to investigate whether by only looking at lexical concepts like measures of vocabulary richness (e.g. type-token-ratio) or the share of big words tell us something about how different the style of our authors is and how the idiosyncrasies in writing style are manifested. Let’s look at some summary statistics of these measures for the top ten most publishing authors in the dataset.
| author | num_books | mean_avg_sent_length | mean_book_token_count | mean_book_avg_token_length | mean_book_types | mean_type_token_ratio_book | mean_herdan_c |
|---|---|---|---|---|---|---|---|
| MartinBrun | 54 | 34.03 | 2323.98 | 4.53 | 1079.54 | 0.50 | 0.91 |
| J.C.Bie | 16 | 27.99 | 4213.31 | 4.69 | 1778.12 | 0.47 | 0.91 |
| J.L.Bynch | 16 | 35.83 | 6113.00 | 4.68 | 2300.19 | 0.45 | 0.90 |
| P.F.Suhm | 14 | 29.44 | 7794.50 | 4.68 | 2623.86 | 0.43 | 0.90 |
| SørenRosenlund | 14 | 34.81 | 6202.64 | 4.48 | 2155.29 | 0.39 | 0.89 |
| ChristianBagge | 11 | 41.45 | 2596.64 | 4.95 | 1147.00 | 0.48 | 0.90 |
| F.C.Scheffer | 9 | 24.23 | 3620.22 | 4.67 | 1565.78 | 0.46 | 0.90 |
| Chr.Thura | 6 | 48.51 | 11856.83 | 4.76 | 3123.83 | 0.38 | 0.89 |
| L.Jæger | 6 | 83.86 | 12752.00 | 4.71 | 3786.00 | 0.31 | 0.88 |
| O.D.Lütken | 6 | 32.53 | 10584.33 | 5.05 | 3236.33 | 0.36 | 0.89 |
\(~\)
Martin Brun wrote 54 pamphlets and is the most represented author. However, his books are rather short. The others, especially L.Jæger, write much longer books. The average token length is supposed to cover the aspect of who might use long words. But no real differences become evident. The type-token-ratio (TTR) is, as already mentioned, a measure that covers vocabulary richness. The higher this ratio the more unique words an author is using which hints towards a presence of a rich vocabulary. However, care needs to be taken as this value is not normalized with resepct to text lenght which means that longer texts automatically have a lower TTR. In our case Brun seems to use a richer vocabulary compared to Thura, Lütken and Jaeger, but this is probably due to their texts being longer. A text-lenght normalized value of vocabulary richness is Herdan’s C. And when looking at the values there are basically no differences. Finally, average sentence lenght differs strongly. Jaeger has very long sentences while Bie very short ones. Now this value is not very reliable, because the bad OCR doesn’t really reliable cover interpunctation signs so this could lead to false sentence detection. To sum up, the lexical measure don’t really show strong differenes for the most publishing authors which in result also means that they might be difficult to distinguish using those as features.
\(~\)
For authorship attribution, Burrow suggest using the most frequent word types (MFW) as these very frequent items (which mainly correspond to function words) are used mainly unconsciously by the author and thus suitable to reflect his style. Every author is then represented by a feature vector or an author profile which can be used to calculate the distance to author authors or single texts. Single text of unknown authorship very close to an authors’ profile could be an indicator that this text was in fact written by that author. To do this I take the following steps:
These vectors can be used to calculate the distance between pamphlets with unknown authorship to author profiles. Moreover, we can check which authors have a similar style, i.e. a small Delta distance value between each other. Again those authors might be difficult to distinguish from one another.
Let’s first have a look at pamphlet with ID 1.1.10 which, according to the bibliography, is of unknown authorship. One can see quite many authors having similar Delta distance values with respect to this pamphlet, making it quite difficult to collect evidence for who might be the real author.
\(~\)
When looking at pamphlet 2.13.1 we see that there is actually quite a gap between the closest author Bie and the next closet one Martin Brun. This might suggest that Bie is the real author of that book. However, we would have to study that in more detail.
\(~\)
For collecting even more evidence and sorting out which pamphlets it is worth to investigate further, I want to perform a visualization of all author profiles and texts. We can’t just do a regular scatterplot as we don’t only have 2 but 300 dimensions to plot. So how can we solve that problem? We use an algorithm for reducing the 300 dimensions from the MFW to 2 which we can plot. In this case, I used UMAP (Uniform Manifold Approximation and Projection for Dimension Reduction) an algorithm that does similar things like a Principal Component Analysis (PCA) or Single Value Decomposition (SVD), namely reducing our dimensions which makes visualisation easier.
This is again an interactive plot. Feel free to explore all points. You can also hide certain points or make them visible by clicking at the legend on the right. Some points might be hidden so it is worth to hide the dots with no author names to see that Bie and Bynch are quite close and thus similar in style? In fact, when looking at the heatmap above (Step 2) Bie and Bynch are only distant by 0.39. However, Bie and Martin Brun are also only distant by 0.50. Brun, however, lies on the other side of the plot. How should this be interpreted?! I don’t know so far :).
\(~\)
Did the 300 uni- bi- and trigram MFW feature vectors capture the style of our authors? Well… to some degree. However, one could certainly explore further options like more features or adding punctuation. The aspects presented here will be the basis for collecting evidence on which set of pamphlets with unknown authorship can be matched with potential author profile candidates for a classic closed-set authorship attribution scenario. In this scenario, I will use various machine learning algorithms and features to perform multinominal text classification. Stay tuned for more infos …. but for now I will wait till I get my hands on the improved set of OCRed books.